Skip to content

Mkulakow/responses api#4039

Open
michalkulakowski wants to merge 24 commits intomainfrom
mkulakow/responses_api
Open

Mkulakow/responses api#4039
michalkulakowski wants to merge 24 commits intomainfrom
mkulakow/responses_api

Conversation

@michalkulakowski
Copy link
Collaborator

🛠 Summary

JIRA/Issue if applicable.
Describe the changes.

🧪 Checklist

  • Unit tests added.
  • The documentation updated.
  • Change follows security best practices.
    ``

Copilot AI review requested due to automatic review settings March 4, 2026 14:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds initial support for the OpenAI Responses API to OVMS LLM serving, wiring a new /v3/responses endpoint through request parsing and unary response serialization, with accompanying unit tests.

Changes:

  • Introduces a new Endpoint::RESPONSES and routes /v3/responses + /v3/v1/responses to it.
  • Adds Responses-specific request parsing (input, max_output_tokens, conflict check with max_completion_tokens) and unary response JSON serialization (object: "response", output: [...]).
  • Extends unit tests to cover Responses parsing and unary response structure.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/test/http_openai_handler_test.cpp Adds unit tests validating Responses parsing and unary response output structure.
src/llm/servable.cpp Routes /v3/responses to the new endpoint and prepares inputs for it.
src/llm/apis/openai_completions.hpp Adds Endpoint::RESPONSES and declares parseResponsesPart().
src/llm/apis/openai_completions.cpp Implements Responses parsing and a new unary serialization path for Responses output format.
Comments suppressed due to low confidence (2)

src/llm/apis/openai_completions.cpp:826

  • For RESPONSES requests, the validation error message says "max_tokens value should be greater than 0", but the endpoint-specific fields are max_output_tokens / max_completion_tokens. Updating the message to reference the actual accepted field(s) will make it clearer for API consumers.
    // specific part of max_tokens validation
    if (request.maxTokens == 0) {
        return absl::InvalidArgumentError("max_tokens value should be greater than 0");
    }

src/llm/servable.cpp:76

  • The invalid-endpoint error message lists /v3/tokenize, but TokenizeParser::isTokenizeEndpoint() accepts any URI that ends with tokenize (including /v3/v1/tokenize). Consider listing both /v3/tokenize and /v3/v1/tokenize (consistent with the other endpoints) or wording it as a suffix-based rule to avoid misleading users.
        return absl::InvalidArgumentError("Wrong endpoint. Allowed endpoints: /v3/chat/completions, /v3/completions, /v3/responses, /v3/tokenize");

Comment on lines +784 to +821
std::optional<uint32_t> maxCompletionTokens;
std::optional<uint32_t> maxOutputTokens;

// max_completion_tokens: uint; optional
it = doc.FindMember("max_completion_tokens");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsUint()) {
if (it->value.IsUint64())
return absl::InvalidArgumentError("max_completion_tokens value can't be greater than 4294967295");
return absl::InvalidArgumentError("max_completion_tokens is not an unsigned integer");
}
if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value())
return absl::InvalidArgumentError(absl::StrCat("max_completion_tokens exceeds limit provided in graph config: ", maxTokensLimit.value()));
maxCompletionTokens = it->value.GetUint();
}

// max_output_tokens: uint; optional
// OpenAI Responses API uses this field for output token limit.
it = doc.FindMember("max_output_tokens");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsUint()) {
if (it->value.IsUint64())
return absl::InvalidArgumentError("max_output_tokens value can't be greater than 4294967295");
return absl::InvalidArgumentError("max_output_tokens is not an unsigned integer");
}
if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value())
return absl::InvalidArgumentError(absl::StrCat("max_output_tokens exceeds limit provided in graph config: ", maxTokensLimit.value()));
maxOutputTokens = it->value.GetUint();
}

if (maxCompletionTokens.has_value() && maxOutputTokens.has_value() && maxCompletionTokens.value() != maxOutputTokens.value()) {
return absl::InvalidArgumentError("max_output_tokens and max_completion_tokens must match when both are provided");
}
if (maxOutputTokens.has_value()) {
request.maxTokens = maxOutputTokens.value();
} else if (maxCompletionTokens.has_value()) {
request.maxTokens = maxCompletionTokens.value();
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenAIChatCompletionsRequest::maxTokens is std::optional<int>, but in RESPONSES parsing you accept max_output_tokens/max_completion_tokens as uint32_t and then assign into request.maxTokens. Values > INT_MAX will overflow / become implementation-defined. Consider changing maxTokens to an unsigned/wider type (e.g., uint32_t/uint64_t) or clamp + validate against std::numeric_limits<int>::max() before assigning.

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +191
if (request.maxTokens.has_value()) {
writer.String("max_output_tokens");
writer.Int(request.maxTokens.value());
}

writer.String("output");
writer.StartArray();
int outputIndex = 0;
for (const auto& parsedOutput : parsedOutputs) {
const std::string outputId = "msg-" + std::to_string(outputIndex++);

writer.StartObject();
writer.String("id");
writer.String(outputId.c_str());
writer.String("type");
writer.String("message");
writer.String("role");
writer.String("assistant");
writer.String("status");
writer.String("completed");
writer.String("content");
writer.StartArray();
writer.StartObject();
writer.String("type");
writer.String("output_text");
writer.String("text");
writer.String(parsedOutput.content.c_str());
writer.String("annotations");
writer.StartArray();
writer.EndArray();
writer.EndObject();
writer.EndArray();
writer.EndObject();
}
writer.EndArray();

writer.String("usage");
writer.StartObject();
writer.String("input_tokens");
writer.Int(usage.promptTokens);
writer.String("input_tokens_details");
writer.StartObject();
writer.String("cached_tokens");
writer.Int(0);
writer.EndObject();
writer.String("output_tokens");
writer.Int(usage.completionTokens);
writer.String("output_tokens_details");
writer.StartObject();
writer.String("reasoning_tokens");
writer.Int(0);
writer.EndObject();
writer.String("total_tokens");
writer.Int(usage.calculateTotalTokens());
writer.EndObject();
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serializeResponsesUnaryResponse() writes numeric fields with writer.Int(...) (e.g., max_output_tokens, input_tokens, output_tokens, total_tokens). Several of these values come from size_t or may exceed signed 32-bit range, which can overflow and produce incorrect JSON. Prefer Int64/Uint64 (or Uint) for token counters/limits to match the validated range and the underlying types.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michalkulakowski can you protect against such corner cases?

Comment on lines +106 to +108
const auto createdAt = std::chrono::duration_cast<std::chrono::seconds>(created.time_since_epoch()).count();
const std::string responseId = "resp-" + std::to_string(createdAt);

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responseId is derived only from createdAt in seconds ("resp-" + std::to_string(createdAt)), which can easily collide under concurrent load (multiple requests within the same second). Use a higher-resolution timestamp and/or add a per-process counter/UUID component to keep response IDs unique.

Copilot uses AI. Check for mistakes.
@michalkulakowski michalkulakowski force-pushed the mkulakow/responses_api branch 2 times, most recently from ea72634 to 299e2be Compare March 5, 2026 17:42
Comment on lines +122 to +123
:::{dropdown} **Unary call with curl using responses endpoint**
**Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:::{dropdown} **Unary call with curl using responses endpoint**
**Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md)
:::{dropdown} **Unary call with cURL using Responses API**
**Note**: Using cURL in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md).

**Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md)

```bash
curl http://localhost:8000/v3/responses -H "Content-Type: application/json" -d "{ \"model\": \"OpenGVLab/InternVL2-2B\", \"input\":[{\"role\": \"user\", \"content\": [{\"type\": \"input_text\", \"text\": \"Describe what is on the picture.\"},{\"type\": \"input_image\", \"image_url\": \"http://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/static/images/zebra.jpeg\"}]}], \"max_output_tokens\": 100}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you break down the command into multiple lines to showcase what is actually being sent?

:::


### Unary calls to responses endpoint using cURL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Unary calls to responses endpoint using cURL
### Unary calls via Responses API using cURL


:::

:::{dropdown} **Streaming request with OpenAI client using responses endpoint**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:::{dropdown} **Streaming request with OpenAI client using responses endpoint**
:::{dropdown} **Streaming request with OpenAI client via Responses API**


## Benchmarking text generation with high concurrency

OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it being added in responses api PR? Is it related?

writer.String("input_tokens_details");
writer.StartObject();
writer.String("cached_tokens");
writer.Uint64(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add note here that it is not supported

writer.String("output_tokens_details");
writer.StartObject();
writer.String("reasoning_tokens");
writer.Uint64(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we support it in this release?


if (inputIt->value.IsString()) {
request.prompt = inputIt->value.GetString();
if (!request.prompt.has_value() || request.prompt.value().empty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request.prompt must have value if you set it one line earlier

}
request.imageHistory.push_back({i, tensor});
} else {
return absl::InvalidArgumentError("Unsupported content type");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you include whats the content type? this way we can easily see in logs what was expected supported content type by user

if (existingMessages != doc.MemberEnd()) {
existingMessages->value = messages;
} else {
doc.AddMember("messages", messages, allocator);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the document being edited? If I understand correctly, we pare doc to our request format. Why do we need to edit the doc? Shouldnt it be left untouched and just work on request in next steps? Why do we parse if we later not use it?

auto& obj = it->value.GetArray()[i];
if (!obj.IsObject())
return absl::InvalidArgumentError("Tool is not a JSON object");
rapidjson::Value* functionObj = nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functionObj is never used outside of scope, you dont need to work on pointers, contrary to parametersValue where it is required

parametersValue->Accept(writer);
std::string parametersStr = buffer.GetString();
ToolSchemaWrapper schemaReprs{parametersValue, std::move(parametersStr)};
request.toolNameSchemaMap[functionNameCStr] = std::move(schemaReprs);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
request.toolNameSchemaMap[functionNameCStr] = std::move(schemaReprs);
request.toolNameSchemaMap[functionName] = std::move(schemaReprs);

if (it->value.GetUint() == 0)
return absl::InvalidArgumentError("n value should be greater than 0");
if (endpoint == Endpoint::RESPONSES && request.stream && it->value.GetUint() > 1)
return absl::InvalidArgumentError("n greater than 1 is not supported for responses streaming");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return absl::InvalidArgumentError("n greater than 1 is not supported for responses streaming");
return absl::InvalidArgumentError("n greater than 1 is not supported for Responses API streaming");

if (endpoint == Endpoint::RESPONSES) {
std::vector<ParsedOutput> parsedOutputs;
for (const auto& tokens : results.tokens) {
updateUsage(usage, tokens, request.echo);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just recently switched to using PerfMetrics @mzegla
Why cant we use it here?

events.emplace_back(serializeResponsesEvent([this, &responseId, createdAt](Writer<StringBuffer>& writer) {
writer.StartObject();
writer.String("type");
writer.String("response.created");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might not be good place for this event to be streamed, please verify vs other servings maybe if OpenAI is not specific enough

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and all other events

Copy link
Collaborator

@mzegla mzegla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a big change, let's review it iteratively. Pushing comments only to the logic - did not review tests just yet.

throw std::invalid_argument("Unsupported JSON value type");
}

std::string serializeResponsesEvent(const std::function<void(Writer<StringBuffer>&)>& eventSerializer) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does it work? evenSerializer carries the data itself?
it's not clear to me what the output string would be

Comment on lines +106 to +120
void serializeNotSupportedNullField(Writer<StringBuffer>& writer, const char* fieldName) {
writer.String(fieldName);
writer.Null();
}

void serializeNotSupportedZeroField(Writer<StringBuffer>& writer, const char* fieldName) {
writer.String(fieldName);
writer.Uint64(0);
}

void serializeNotSupportedEmptyArrayField(Writer<StringBuffer>& writer, const char* fieldName) {
writer.String(fieldName);
writer.StartArray();
writer.EndArray();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not supported?

Comment on lines +157 to +158
const char* status, const std::string& fullOutputText, bool includeUsage,
const char* incompleteReason, const char* errorMessage, const char* errorCode) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of parameters. Can we use std::strings and get const char* internally, so it's more straightforward to use?

std::string serializeStreamingUsageChunk();
std::string serializeStreamingHandshakeChunk();
std::string serializeResponsesStreamingInitEvents();
std::string serializeResponsesFailedEvent(const std::string& errorMessage, const char* errorCode = "server_error");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error code as string? maybe we could have some enum here

Comment on lines +179 to +194
if (executionContext->apiHandler && executionContext->apiHandler->isStream() && executionContext->endpoint == Endpoint::RESPONSES) {
std::string failedEvent = executionContext->apiHandler->serializeResponsesFailedEvent(e.what());
executionContext->response = wrapTextInServerSideEventMessage(failedEvent);
executionContext->response += wrapTextInServerSideEventMessage("[DONE]");
cc->Outputs().Tag(OUTPUT_TAG_NAME).Add(new std::string{std::move(executionContext->response)}, iterationBeginTimestamp);
return absl::OkStatus();
}
return absl::InvalidArgumentError(e.what());
} catch (...) {
if (executionContext->apiHandler && executionContext->apiHandler->isStream() && executionContext->endpoint == Endpoint::RESPONSES) {
std::string failedEvent = executionContext->apiHandler->serializeResponsesFailedEvent("Response generation failed");
executionContext->response = wrapTextInServerSideEventMessage(failedEvent);
executionContext->response += wrapTextInServerSideEventMessage("[DONE]");
cc->Outputs().Tag(OUTPUT_TAG_NAME).Add(new std::string{std::move(executionContext->response)}, iterationBeginTimestamp);
return absl::OkStatus();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that it will be handled correctly with OK status? I mean for example, will it exit properly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in responses api streaming, opposites to chat/completions, error handling is realized through special event rather than status code.

Comment on lines +133 to +139
std::string initEvents = executionContext->apiHandler->serializeResponsesStreamingInitEvents();
if (!initEvents.empty()) {
executionContext->response = wrapTextInServerSideEventMessage(initEvents);
cc->Outputs().Tag(OUTPUT_TAG_NAME).Add(new std::string{std::move(executionContext->response)}, iterationBeginTimestamp);
executionContext->response = "";
}
cc->Outputs().Tag(LOOPBACK_TAG_NAME).Add(new bool{true}, iterationBeginTimestamp);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that whole thing okay? You schedule and don't read/check output before return.
Aren't we at risk of skipping chunk? Especially if it's generated fast?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because here we emits responses api events that don't contain data generate by pipeline

return false;
}

static py::object rapidJsonValueToPyObject(const rapidjson::Value& value) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that whole thing necessary? Why don't we serialize to string -> pass to python -> read to json?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 3 comments.

Comment on lines +2040 to +2056
std::string OpenAIChatCompletionsHandler::serializeResponseCreatedEvent(const std::string& responseId, int64_t createdAt) {
StringBuffer buffer;
Writer<StringBuffer> writer(buffer);
writeEventHeader(writer, "response.created");
writer.String("response");
serializeResponsesResponseObject(writer, responseId, createdAt, "in_progress", "", false);
writer.EndObject();
return buffer.GetString();
}

std::string OpenAIChatCompletionsHandler::serializeResponseInProgressEvent(const std::string& responseId, int64_t createdAt) {
StringBuffer buffer;
Writer<StringBuffer> writer(buffer);
writeEventHeader(writer, "response.in_progress");
writer.String("response");
serializeResponsesResponseObject(writer, responseId, createdAt, "in_progress", "", false);
writer.EndObject();
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response.created / response.in_progress events serialize a full response object via serializeResponsesResponseObject(), which currently always emits an output array (and may include a placeholder message item). This conflicts with the subsequent emission of response.output_item.added / response.content_part.added events and can lead to duplicated/contradictory stream semantics. Consider making created/in_progress events omit output entirely (or emit an empty output array) and rely on the dedicated output_item/content_part events for initialization.

Copilot uses AI. Check for mistakes.
Comment on lines +341 to +345
// TODO: error not supported in unary response
writer.String("model");
writer.String(request.model.c_str());
writer.String("status");
writer.String(responseStatus.c_str());
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serializeResponsesUnaryResponse() does not emit several spec-aligned fields that the new unit tests expect (e.g. "error":null, "previous_response_id":null, "reasoning":null, "user":null). This will cause the added tests (e.g. serializeUnaryResponseForResponses*) to fail and makes unary and streaming response objects structurally inconsistent. Reuse serializeResponsesResponseObject() for unary responses or explicitly add the missing null fields so the unary payload matches the Responses API schema.

Copilot uses AI. Check for mistakes.
}

std::string OpenAIChatCompletionsHandler::serializeResponsesStreamingInitEvents() {
const auto createdAt = std::chrono::duration_cast<std::chrono::microseconds>(created.time_since_epoch()).count();
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responses streaming uses createdAt in microseconds (duration_cast<microseconds>) and passes it to created_at in the response object. This makes created_at inconsistent with unary responses (seconds) and can even be larger than completed_at (which is seconds), breaking ordering and likely violating the OpenAI Responses API expectations. Use a consistent unit (typically epoch seconds) for created_at/IDs across unary and streaming, and ensure completed_at is in the same unit family.

Suggested change
const auto createdAt = std::chrono::duration_cast<std::chrono::microseconds>(created.time_since_epoch()).count();
const auto createdAt = std::chrono::duration_cast<std::chrono::seconds>(created.time_since_epoch()).count();

Copilot uses AI. Check for mistakes.
+ is_fc_model=True,
+ underscore_to_dot=True,
+ ),
+ "ovms-model-stream-responses": ModelConfig(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe ovms-model-responses-stream?
more straightforward switching

}
#else
ov::genai::ChatHistory& chatHistory = executionContext->apiHandler->getChatHistory();
constexpr bool add_generation_prompt = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
constexpr bool add_generation_prompt = true;
constexpr bool addGenerationPrompt = true;

?

writer.String(toolCall.arguments.c_str());
writer.EndObject();
}
if (!fullOutputText.empty() || responsesState.toolCalls.empty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have assumption that if there were tool calls then there is no other content we should document it - for example as a comment here to be explicit.

}
}

request.chatHistory.last()["content"] = contentText;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1922 to +1927
if (input_ids.get_shape().size() != 2)
throw std::runtime_error("input_ids should have 2 dimensions");
if (input_ids.get_shape()[0] != 1)
throw std::runtime_error("input_ids should have 1 batch size");
if (input_ids.get_element_type() != ov::element::i64)
throw std::runtime_error("input_ids should have i64 element type");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does user see those errors or are they on our DEBUG logging?
User does not see input_ids, so it's strictly debugging information

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume these exception should never happen, but we have same check in chat completions part few lines below.


int64_t* inputIdsData = reinterpret_cast<int64_t*>(input_ids.data());
std::vector<int64_t> generatedTokens(inputIdsData, inputIdsData + input_ids.get_shape()[1]);
parsedOutputs.push_back(parseOutputIfNeeded(generatedTokens));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseOutputIfNeeded strictly requires tokens? that's why we need to do this entire operation here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly

const auto& toolCallsArr = deltaObj["tool_calls"];
for (rapidjson::SizeType i = 0; i < toolCallsArr.Size(); ++i) {
const auto& tc = toolCallsArr[i];
int tcIndex = tc.HasMember("index") ? tc["index"].GetInt() : 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that for two iterations in the loop we get to have tcIndex = 0 here and overwrite?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that its not possible, since BaseOutputParser always include id in wrapDelta methods

Comment on lines +172 to +173
writer.String("parallel_tool_calls");
writer.Bool(false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have sup port for paralellel tool calls, why is it hardcoded to false here?

if (!responsesState.reasoningText.empty()) {
writer.StartObject();
writer.String("id");
writer.String("rs-0");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is rs-0? is it duplicated in multi-turn conversation? or is it for cases where model mixes reasoning and content multiple times within 1 unary response?

writer.StartObject();
writer.String("input_tokens");
writer.Uint64(static_cast<uint64_t>(usage.promptTokens));
// TODO: input_tokens_details.cached_tokens not supported
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is table for completions and chat completions API what we support and what we dont. Is there a chance that we could have the same for Responses API? This way, in future we will be able to talk about it clearny and see what is missing and estimate next tasks correctly. I'm just afraid people will assume something is done therefore where in fact its not

} else {
return absl::InvalidArgumentError("input is not a string or array");
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto kwargsIt = doc.FindMember("chat_template_kwargs");
if (kwargsIt == doc.MemberEnd()) {
rapidjson::Value kwargs(rapidjson::kObjectType);
kwargs.AddMember("enable_thinking", true, doc.GetAllocator());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Barely any model really uses this to enable thinking. Qwen3-VL for example doesnt. Is it an issue? @michalkulakowski @dtrawins

std::stringstream ss;
ss << events.front();
for (size_t i = 1; i < events.size(); ++i) {
ss << "\n\ndata: " << events[i];
Copy link
Collaborator

@dkalinowski dkalinowski Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already util for that, why cant we reuse it here? wrapTextInServerSideEventMessage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants